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Abstract — We propose a construction for joint feature learning 
and clustering of multichannel extracellular electrophysiological 
data across multiple recording periods for action potential 
detection and discrimination ("spike sorting"). Our construction 
improves over the previous state-of-the art principally in four 
ways. First, via sharing information across channels, we can bet- 
ter distinguish between single-unit spikes and artifacts. Second, 
our proposed "focused mixture model" (FMM) elegantly deals 
with units appearing, disappearing, or reappearing over multiple 
recording days, an important consideration for any chronic 
experiment. Third, by jointly learning features and clusters, we 
improve performance over previous attempts that proceeded via 
a two-stage ("frequentist") learning process. Fourth, by directly 
modeling spike rate, we improve detection of sparsely spiking 
neurons. Moreover, our Bayesian construction seamlessly handles 
missing data. We present state-of-the-art performance without 
requiring manually tuning of many hyper-parameters on both a 
public dataset with partial ground truth and a new experimental 
dataset. 
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I. Introduction 

SPIKE sorting of extracellular electrophysiological data is 
an important problem in contemporary neuroscience, with 
applications ranging from brain-machine interfaces pT| to 
neural coding |23J and beyond. Despite a rich history of work 
in this area ||10t, f32), room for improvement remains for auto- 
matic methods. An ideal tool for spike sorting achieves state- 
of-the-art performance and satisfies the following desiderate: 

1) is fully automatic, obviating the need for the user to 
manually tune many "hyperparameters", especially the 
number of single-units, 

2) benefits from multiple electrodes, multiple "sessions", 
and generally, more data, 

3) is robust to artifactual noise, due to movement, for 
example, 

4) handles "missing data", 

5) copes with changes in waveform shape over many 
days/weeks of recording, and 

6) captures sparsely firing neurons. 
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Here we propose a Bayesian generative model and asso- 
ciated inference procedure; the first, to our knowledge, that 
satisfies all of the above desiderata. 

Perhaps the most important advance in our present work 
over previous art is our joint feature learning and clustering 
strategy. More specifically, standard pipelines for processing 
extracellular electrophysiology data consist of the following 
steps: (i) filter the raw sensor readings, (ii) perform thresh- 
olding to "detect" the spikes, (Hi) map each detected spike 
to a feature vector, and then (iv) cluster the feature vectors 
pO| . Our primary conceptual contribution to spike sorting 
methodologies is a novel unification of steps (in) and (iv) 
that utilizes all available data in such a way as to satisfy all 
of the the above criteria. This "joint dictionary learning and 
clustering" approach improves results even for single channel 
and single recording experiments. More channels and more 
recordings simply improve our performance. 

A. Previous Art 

Although a comprehensive survey of previous spike sorting 
methods is beyond the scope of this manuscript, below we 
provide a summary of previous work as relevant to the above 
listed desiderata. 

Perhaps those methods that are most similar to ours include 
a number of recent Bayesian methods for spike sorting |[8), 
| |T3| . One can think of our method as a direct extension of 
theirs with a number enhancements. Most importantly, we 
learn features for clustering, rather than simply using principal 
components. We also incorporate multiple electrodes, assume 
a more appropriate prior over the number of clusters, and 
address longitudinal data. 

Other popular methods utilize principal components anal- 
ysis (PCA) | 20) or wavelets |19| to find low-dimensional 
representations of waveforms for subsequent clustering. These 
methods typically require some manual tuning, for exam- 
ple, to choose the number of retained principal components. 
Moreover, these do not naturally handle missing data very 
well. Finally, these methods do not choose low-dimensional 
embeddings appropriate for downstream clustering, rather, 
they just "hope" that their low-dimensional embeddings do 
not discard valuable discriminating information. 

Calabrese et al. |7| recently proposed a Mixture of Kalman 
Filters (MoK) model to explicitly deal with slow changes 
in waveform shape. This approach also models spike rate 
(and even refractory period), but it does not address our 
other desiderata, perhaps most importantly, utilizing multiple 
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electrodes or longitudinal data. It would be interesting to 
extend this work to utilize learned time-varying dictionaries 
rather than principal components. 

Finally, several recently proposed methods address sparsely 
firing neurons |f2), p2| . By directly incorporating firing rate 
into our model and inference algorithm (see Section [Tl-C| l, our 
approach outperforms previous methods even in the absence 



of manual tuning (see Section III-Ei. 

The remainder of our manuscript is organized as follows. 
Section |ll] begins with a conceptual description of our model 
followed by mathematical details and experimental methods 



for new data. Section III begins by comparing the performance 
of our approach to several other previous state of the art meth- 
ods, and then highlights the utility of a number of additional 
features that our method includes. Section |IV] summarizes 
and provides some potential future directions. The appendix 
provides details of the relationships between our method and 
other related Bayesian models or constructions. 

II. Models and Analysis 
A. Model Concept 

Our generative model derives from knowledge of the proper- 
ties of electrophysiology signals. Specifically, we assume that 
each waveform can be represented by a sparse mixture of sev- 
eral dictionary elements, or features. Rather than presupposing 
a particular form of those features (as in much of harmonic 
analysis, for example, wavelets), we learn these features from 
the data. Importantly, we learn these features for the specific 
task at hand: spike sorting (aka, clustering). This is in contrast 
to other popular feature learning approaches, such as PCA 
or independent component analysis, which learn features to 
optimize a different objective function (for example, minimiz- 
ing reconstruction error). This joint dictionary learning and 
clustering is a powerful idea with demonstrably good perfor- 
mance in a number of applications |38|. Moreover, statistical 
guarantees associated with such approaches are beginning to 
be understood |24|. Section jll-Bl provides mathematical details 
for our Bayesian dictionary learning assumptions. 

The above generative model requires a prior on the number 
of clusters. We assume a flexible prior motivated by the 
data. Specifically, regardless of the number of putative spikes 
detected, the number of different single units one could 
conceivably discriminate from a single electrode is upper 
bounded due to the conductive properties of the tissue. Thus, 
Bayesian non-parametric methods that enable the number of 
clusters to increase to infinity as the number of threshold 
crossings increases are inappropriate. Moreover, our prior is 
also appropriate for chronic recordings, in which single units 
may appear for a subset of the recording days, but also 
disappear and reappear intermittently. We refer to this prior as 
part of a mixture model a "focused mixture model" (FMM). 



Sections II-C and II-D provide mathematical details for the 
general mixture modeling case, and our specific focused 
mixture model assumptions. 

We are also interested in multichannel recordings. When 
we have multiple channels that are within close proximity to 
one another, we can "borrow strength" across the channels to 



improve clustering accuracy. Moreover, we can ascertain that 
certain movement or other artifacts that look like spikes on 
a single channel, are in fact noise. For recording in awake 
behaving animals, such artifacts can be quite common. 

Finally, we explicitly model the spike rate of each cluster 
This can help address refractory issues, and perhaps more 
importantly, enables us to detect sparsely firing neurons with 
high accuracy. 

Because our model is fully Bayesian, we can easily generate 
data from our model to address missing data. Moreover, by 
placing relatively diffuse but informed hyperpriors on our 
model, our approach does not require any manual tuning. And 
by reformulating our priors, we can derive (local) conjugacy 



which admits efficient Gibbs sampling. Section II-E provides 
details on these computations. 



B. Bayesian dictionary learning 

Consider electrophysiological data measured over a pre- 
scribed time interval. Specifically, let X^^ e M^^^ represent 
the f^ signal observed during interval i (note that each j 
indexes a threshold crossing within a time interval i; we do 
not indicate this dependency notationally for brevity). The data 
are assumed recorded on each of N channels, from an A^- 
element sensor array, and there are T time points associated 
with each detected spike waveform (the signals are aligned 
with respect to the peak energy of all the channels). In tetrode 
arrays ITTI, and related devices like those considered below, 
a single-unit event (action potential of a neuron) may be 
recorded on multiple adjacent channels, and therefore it is of 
interest to process the N signals associated with X^^ jointly; 
the joint analysis of all N signals is also useful for longitudinal 
analysis, discussed in Section [III] 

To constitute data X^j, we assume that threshold-based 
detection (or a related method) is performed on data measured 
from each of the N sensor channels. When a signal is detected 
on any of the channels, coincident data are also extracted 
from all N channels, within a window of (discretized) length 
T centered at the detection peak. On some of the channels 
data may be associated with a single-unit event, and on other 
channels the data may represent background noise. Both types 
of data (signal and noise) are modeled jointly, as discussed 
below. 

Following |[8), we employ dictionary learning to model each 
Xij; however, unlike |8| we jointly employ dictionary learning 
to all N channels in Xy (rather than separately to each of the 
channels). The data are represented 



Xij — DASjj 



E,: 



(1) 



where D G M^^^ represents a dictionary with K dictionary 
elements (columns), A G M.^'^^ is a diagonal matrix with 
sparse diagonal elements, S^ G M^^^ represents the dic- 
tionary weights (factor scores), and Eij G R-^^^ represents 
residual/noise. Let D = (di, . . . , (Ik) and E = (ei, . . . , e^), 
with dfc, e„ G M"'". We impose priors 



dk ^ N{0, ^It) , e„ 



.AA(0,diag(77r\...,?7T')): (2) 
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where It is the TxT dimensional identity matrix and rjt eM. 
for all t. 

We wish to impose that each column of X^^ lives in a linear 
subspace, with dimension and composition to be inferred. 
The composition of the subspace is defined by a selected 
subset of the columns of D, and that subset is defined by 
the non-zero elements in the diagonal of A = diag(A), with 
A 



{Xi, . . . , Xk) and Xk € M. for all k. We impose 
i^Sq + (1 — i^)J\f^{0,aQ^), with i/ ^ Beta(ao,fco) ^nd 



Sq a unit measure concentrated at zero. The hyperparameters 
Co, 60 G ]R are set to encourage sparse A, and JV+{-) represents 
a normal distribution truncated to be non-negative. Diffuse 
gamma priors are placed on {i]t} and ao. 

Concerning the model priors, the assumption d^ ~ 
Af{0, ^It) is consistent with a conventional £2 regulariza- 
tion on the dictionary elements. Similarly, the assumption 
e„ ~ Af{0,dmg{r]^^,...,7]j^^)) corresponds to an £2 fit of 
the data to the model, with a weighting on the norm as a 
function of the sample point (in time) of the signal. These 



priors are typically employed in dictionary learning; see |36| 



for a discussion of the connection between such priors and 
optimization-based dictionary learning. 

C. Mixture modeling 

A mixture model is imposed for the dictionary weights 



{s 



ijl, ■ 



,Si.N), with s„„ e 



s,o„ defines the 



weights on the dictionary elements for the data associated with 
the nth channel (nth column) in X^^. Specifically, 



^ijn 



'Af{tiz„,i,i^l 



^M 



EJ" (Of 

m=l "''" "■• 



(/X„i„ , n,nn ) -- Go (a*0 , /3o , W^O , I'D ) 



(4) 



where Go is a normal-Wishart distribution with fi^ ?l K 
dimension vector of zeros, /?o = !> W^o is a iiT dimensional 
identity matrix, and fo = K. The other parameters: 7r„/ > 0, 

Sm=i """i ~ ^' ^^'^ {sijn\n=i,N ^rc all associatcd with 
cluster Zij-, Zij e {1, . . . , M} is an indicator variable defining 
with which cluster X^ is associated, and M is a user-specified 
upper bound on the total number of clusters possible. 

The use of the Gaussian model in (|3]l is convenient, as 
it simplifies computational inference, and the normal-Wishart 
distribution Go is selected because it is the conjugate prior for 
a normal distribution. The key novelty we wish to address in 
this paper concerns design of the mixture probability vector 
7rW = (4^...,7r«)^. 



D. Focused Mixture Model 

The vector tt*^*' defines the probability with which each of 
the AI mixture components are employed for data recording 
interval i. We wish to place a prior probability distribution 
on ■K^''\ and to infer an associated posterior distribution 
based upon the observed data. Let 6„V be a binary variable 
indicating whether interval i uses mixture component m. Let 
(j)m correspond to the relative probability of including mixture 
component m in interval i, which is related to the firing rate 
of the single-unit corresponding to this cluster during that 



interval. Given this, we have our prior over clusters: 

where Z = X]m'=i "rn'%n' i^ '■h^ normalizing constant to 
ensure that ^^ -Km = 1. To finalize this parameterization, 
we further assume the following priors on 6„V and (pi ■ 

cj).ra ^ Ga(7o, 1), Pi ^ Beta(ao, 60) (6) 



6«~BernK„), 

u^ - Beta(a/M, 1), 70 - Ga(co, 1/do) (7) 

where Ga(-) denotes the gamma distribution, and Bern(-) the 
Bernoulli distribution. Note that {(/)m, ^'m}r?j=i.A/ are shared 
across all intervals i, and it is in this manner we achieve joint 
clustering across all time intervals. The reasons for the choices 
of these various priors is discussed in Section |IV-B| when 
making connections to related models. For example, the choice 
hm ^ Bern(t'„i) with v„i ^ Beta(a/M, 1) is motivated by the 
connection to the Indian buffet process | [T5| as M -^ cx). 

Note that we explicitly model the probability of spiking for 
each cluster component in each time interval. This implies that 
data are mapped to the same cluster if they have consistent 
signal shape and if the associated firing rate is consistent 
(note that the firing rates for some individual neurons can vary 
widely — from a few spikes/sec to greater than one hundred — 
but motor neuron firing rates are typically much lower and 
less variable). Consequently both the signal shape and firing 
rate dictates how the data are clustered. 



E. Computations 

The posterior distribution of model parameters is approxi- 
mated via Gibbs sampling. Most of the update equations for 
the model are relatively standard due to conjugacy of consec- 
utive distributions in the hierarchical model; these "standard" 
updates are not repeated here (see \^). Perhaps the most 
important update equation is for (j)m, as we found this to be a 
critical component of the success of our inference. To perform 
such sampling we utilize the following lemma. 

Lemma 11.1. Denote s{n,j) as the Sterling numbers of the 
first kind [18^ and F(n,j) = ( — l)""'"-'s(n, j)/n! as their 
normalized and unsigned representations, with F(0, 0) = 1, 
F(n,0) = if n > 0, F{n,j) ^ if j > n and 



F{n + l,j) = ;j^F(n,j) + ^,F{n,j - I) if I < J < n. 
Assuming n ~ NegBin(^,p) is a negative binomial distributed 
random variable, and it is augmented into a compound Poisson 
representation /ji^ as 



I 
n = ^M;, ui 
1=1 



Logip), £^Pois{-(b\n{l-p)) (8) 



where Log(p) is the logarithmic distribution /pV with probabil- 
ity generating function G{z) = ln(l — pz)/ln(l — p), \z\ < 
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p ^, then we have 
for j = 0, 1,--- ,n. 



j'=i 



F{n,3') 



(9) 



The proof is provided in the Appendix. 

Let the total set of data measured during interval i be 
represented X'i — {Xij} ■ 'j^, where Mi is the total number 
of events during interval i. Let n*„j represent the number 
of data samples in X'i that are apportioned to cluster m E 
{1, . . . , M} = S, with Mi= J2rn=i '^irn- To Sample 0^, since 
p((?!-„|p,n*^) P C n.,h (-;)^i NegBin(n*„;<?!)„,p,)Ga(<?!)„;7o,l) 
(see Appendix |IV-B| for details), using Lemma |II.1| we can 
first sample a latent count variable £i„i for each n*^ as 

Pr(^jm = %*,„> '/'m) = ^0™ «m> 0> ^ = 0, • ' ' , <„■ (10) 

Since £im ^ Pois(— (/i,„ ln(l — Pi)), using the conjugacy 
between the gamma and Poisson distributions, we have 



Ga ( 70 + Ej:6W = i ^^m, 1-^ (.)^ ln(l-p, 



(11) 



Notice that marginalizing out (/)m in f^m ~ Pois{—(f>m ln(l — 
Pi)) results in ^i„ -- NegBin(7o, ^^^^f^), therefore, we 
can use the same data augmentation technique by sampling a 
latent count £,;,„ for each £i,n and then sampling 70 using the 
gamma Poisson conjugacy as 



Ga co + X],,:fc(')=i4 



''''~^.:bW = l 



1 

In (1- 



-In(l-P,) 



Another important parameter is bm . Since 6„V can only 



be zero if n*^ = and when ri*„j 



^(^) 



0, Pr(6;,V = l|-j oc 
NegBin(O;0„,pi)7r„i and Pr(6}A^ = 0|-) ex (1 - 7r,„), we 
have 



k(^) 



Bernoulli 



A large pi thus indicates a large variance-to-mean ratio on n*^^ 
and Adi. Note that when bm = 0, the observed zero count 
T7-*„ = is no longer explained by n*^ ^ NegBin(r,„,pi), 
this satisfies the intuition that the underlying beta-Bernoulli 
process is governing whether a cluster would be used or not, 
and once it is activated, it is ?',„ and pi that control how much 
it would be used. 



F. Data Acquisition and Pre-processing 

In this work we use two datasets, the popular "hc-1" 
dataset yj and a new dataset based upon experiments we 
have performed with freely moving rats (institutional review 
board approvals were obtained). These data will be made 



available to the research community. Six animals were used 
in this study. Each animal was trained, under food restriction 
(15 g/animal/day, standard hard chow), on a simple lever- 
press-and-hold task until performance stabilized and then 
taken in for surgery. Each animal was implanted with four 
different silicon microelectrodes (NeuroNexus Technologies; 
Ann Arbor, MI; custom design) in the forelimb region of 
the primary or supplementary motor cortex. Each electrode 
contains up to 16 independent recording sites, with variations 
in device footprint and recording site position (e.g. Figure 



4(a) I. Electrophysiological data were measured during one- 
hour periods on eight consecutive days, starting on the day 
after implant (data were collected for additional days, but 
the signal quality degraded after 8 days, as discussed below). 
The recordings were conducted in a high walled observation 
chamber under freely-behaving conditions. Note that nearby 
sensors are close enough to record the signal of a single or 
small group of neurons, termed a single-unit event. However, 



in the device in Figure 4(a) all eight sensors in a line are too 



far separated to simultaneously record a single-unit event on 
all eight. 

The data were bandpass filtered (0.3-3 kHz), and then all 
signals 3.5 times the standard deviation of the background 
signal were deemed detections. The peak of the detection was 
placed in the center of a L3 msec window, which corresponds 
to r = 40 samples at the recording rate. The signal Xy G 
jjTxw corresponds to the data measured simultaneously across 
all N channels within this window. Here A^ = 8, with a 
concentration on the data measured from the 8 channels of 



the zoomed-in Figure 4(a) 



III. Results 

For these experiments we used a truncation level of ii' = 40 
dictionary elements, and the number of mixture components 
was truncated to M — 20 (these truncation levels are upper 
bounds, and within the analysis a subset of the possible 
dictionary elements and mixture components are utilized). 
In dictionary learning, the gamma priors for {rjt} and ao 
were set as Ga(10~^, 10~^). In the context of the focused 
mixture model, we set ao — fco — 1, cq = 0.1 and 
do = O.L Prior Ga(10~^, 10~^) was placed on parameter a 
related to the Indian Buffet Process (see Appendix |IV-B| for 
details). None of these parameters have been tuned, and many 
related settings yield similar results. In all examples we ran 
6,000 Gibbs samples, with the first 3,000 discarded as burn- 
in (however, typically high-quality results are inferred with 
far fewer samples, offering the potential for computational 
acceleration). 

A. Real data with partial ground truth 

We first consider publicly available dataser] hc-L These 
data consist of both extracellular recordings and an intracel- 
lular recording from a nearby neuron in the hippocampus of 
an anesthetized rat p6) . Intracellular recordings give clean 
signals on a spike train from a specific neuron, providing 



available from http://crcns.org/data-sets/hc/hc-l 



^available from http://crcns.org/data-sets/hc/hc- 
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accurate spike times for that neuron. Thus, if we detect a 
spike in a nearby extracellular recording within a close time 
period (< 0.5ms) to an intracellular spike, we assume that the 
spike detected in the extracellular recording corresponds to the 
known neuron's spikes. 

For the accuracy analysis, we determine one cluster that 
corresponds to the known neuron. We consider a spike to be 
correctly sorted if it is a known spike and is in the known 
cluster or if it is an unknown spike in an unknown cluster 
Formally, we define: 



Accuracy = s 1 



^p I J^n 



Total number of waveforms 



X 100% 



(13) 

where Fp and i^„ are the total number of false positives and 
negatives, respectively. 

We considered the widely used data d533101 and the 
same preprocessing from |7]. These data consist of 4-channel 
extracellular recordings and 1-channel intracellular recording. 
We used 2491 detected spikes and 786 of those spikes came 
from the known neuron. Accuracy of cluster results based on 
multiple methods are shown in Figure [T] We consider several 
different clustering schemes and two different strategies for 
learning low-dimensional representations of the data. Specif- 
ically, we learn low-dimensional representations using either: 
dictionary learning (DL) or the first two principal components 
(PC). Given this representation, we consider several differ- 
ent clustering strategies: (i) Matrix Dirichlet Process (MDP), 
which implements a DP on the X^^ matrices, as opposed to 
previous DP approaches on vectors |[8), p3) , (ii) focused mix- 
ture model (as described above), (iii) Hierarchical DP (HDP), 
(iv) independent DP (both DPs are from 181), (v) Mixture 
of Kalman filters (MoK) [7], (vi) Gaussian mixture models 
(GMM) [Bl, and (vii) K-means (KMEANS) 1^. Although 
we do not consider all pairwise comparisons, we do consider 
many options. Note that all of the DL approaches are novel. It 
should be clear from Figure [Tlthat dictionary learning enhances 
performance over principal components for each clustering 
approach. Moreover, sharing information across channels, as 
in MDP and FMM (both novel constructions), seems to further 
improve performance. 

In Figure [2] we visualize the waveforms in the first 2 
principle components for comparison. In Figure l2k, we show 
ground truth to compare to the results we get by clustering 
from the K-means algorithm shown in Figure l2V) and the 
clustering from the GMM shown in Figure |2};. We observe 
that both K-means and GMM work well, but due to the 
constrained feature space they incorrectly classify some spikes 
(marked by arrows). However, the proposed model, shown in 
Figure l2|d), which incorporates dictionary learning with spike 
sorting, infers an appropriate feature space (not shown) and 
more effectively clusters the neurons. 

Note that in Figure [T] and l2] in the context of PC A features, 
we considered the two principal components (similar results 
were obtained with the three principal components); when we 
considered the 20 principal components, for comparison, the 
results deteriorated, presumably because the higher-order com- 
ponents correspond to noise. An advantage of the proposed 



approach is that we model the noise explicitly, via the residual 
Eij in (fill; with PC A the signal and noise are not distinguished 
as well. 
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2 92% 
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Fig. 1. Accuracy of the various methods on d533101 data^Wl. 
All abbreviations are explained in the main text (Section |///-A| >. 
Note that dictionary learning dominates performance over principal 
components. Moreover, modeling multiple channels (as in MDP 
and FMM) dominates performance over operating on each channel 
separately. 
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Fig. 2. Clustering results show n in the 2 PC space of the various 
methods on d533101 data \J6 V. All abbreviations are explained in 
the main text (Section \lll-fi\ . Known Neuron " denotes waveforms 
associated with the neuron from the 1-cell intracellular recording. 
Note that all methods are shown in the first two PC for visualization, 
but that the FMM-DL shown in (d) is jointly learning the feature 
space and clustering. 



B. Handling missing data 

The quantity of data acquired by a neural recording system 
is enormous, and therefore in many systems one first performs 
spike detection (for example, based on a threshold), and then 
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a signal is extracted about each detection (a temporal window 
is placed around the peak of a given detection). This step is 
often imperfect, and significant portions of many of the spikes 
may be missing due to the windowed signal extraction (and 
the missing data are not retainable, as the original data are 
discarded). Conventional feature-extraction methods typically 
cannot be applied to such temporally clipped signals. 

Returning to (fTJ, this implies that some columns of the data 
Xij may have missing entries. Conditioned on D, A, Sjj, and 
{m,---iVT), we have X,^ ~ A/'(DASij,diag(?7f \ . . . ,77^;^). 
The missing entries of Xy may be treated as random variables, 
and they are integrated out analytically within the Gaussian 
likelihood function. Therefore, for the case of missing data in 
Xij, we simply evaluate (IT]) at the points of X^j for which data 
are observed. The columns of the dictionary D of course have 
support over the entire signal, and therefore given the inferred 
Sij (in the presence of missing data), one may impute the 
missing components of Xy via DAS.y. As long as, across 
all Xij, the same part of the signal is not clipped away (lost) 
for all observed spikes, by jointly processing all of the retained 
data (all spikes) we may infer D, and hence infer missing data. 

In practice we are less interested in observing the imputed 
missing parts of X^ than we are in simply clustering the 
data, in the presence of missing data. By evaluating X^j ^ 
J\f{T)ASij,dmg{rj^^, . . . ,7j^^) only at points for which data 
are observed, and via the mixture model in (Hll, we directly 
infer the desired clustering, in the presence of missing data 
(even if we are not explicitly interested in subsequently 
examining the imputed values of the missing data). 

To examine the ability of the model to perform clustering 
in the presence of missing data, we reconsider the publicly 
available data from Section IIII-AI For the first 10% of the 
spike signals (300 spike waveforms), we impose that a fraction 
of the beginning and end of the spike is absent. The original 
signals are of length T — AO samples. As a demonstration, 
for the "clipped" signals, the first 10 and the last 16 samples 
of the signals are missing. A clipped waveform example is 



shown in Figure 3(a) we compare the mean estimation of 
the signal, and the error bars reflect one standard deviation 
from the full posterior on the signal. In the context of the 
analysis, we processed all of the data as before, but now with 
these "damaged'Vclipped signals. We observed that 94.11% 
of the non-damaged signals were clustered properly (for the 
one neuron for which we had truth), and 92.33% of the 
damaged signals were sorted properly. The recovered signal 
in Figure |3(a)| is typical, and is meant to give a sense of the 
accuracy of the recovered missing signal. The ability of the 
model to perform spike sorting in the presence of substantial 
missing data is a key attribute of the dictionary-leaming-based 
framework. 



C. Longitudinal analysis of electrophysiological data 



Figure 4(b) a) shows the recording probe used for the 
analysis of the rat motor cortex data. Figure |4(b)| shows 
assignments of data to each of the possible clusters, for data 
measured across the 8 days, as computed by the proposed 
model (for example, for the first three days, two clusters 
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Fig. 3. Our generative model easily addresses missing data, (a) 
Example of a clipped waveform from the publicly available data 
(blue), original waveform (gray) and recovery waveform (black); 
the error bars reflect one standard deviation from the posterior 
distribution on the underlying signal, (b) Relative errors (with respect 
to the mean estimated signal). 



were inferred). Results are shown for the maximum-UkelUiood 
collection sample. As a comparison to FMM-DL, we also 
considered the non-focused mi xture m odel (NFMM-DL) con- 

with the 6(*' set to all 



IV-B 



struction discussed in Section 

ones (in both cases we employ the same form of dictionary 



learning, as in Section II-B 1. From Figure 4(c) it is observed 



that on held-out data the FMM-DL yields improved results 
relative to the NFMM-DL. 

In fact, the proposed model was developed specifically 
to address the problem of multi-day longitudinal analysis 
of electrophysiological data, as a consequence of observed 
limitations of HDP (which are only partially illuminated by 
Figure 4(c) 1. Specifically, while the focused nature of the 
FMM-DL allows learning of specialized clusters that occur 
over limited days, the "non-focused" HDP-DL tends to merge 
similar but distinct clusters. This yields HDP results that are 
characterized by fewer total clusters, and by cluster charac- 
teristics that are less revealing of detailed neural processes. 
Patterns of observed neural activity may shift over a period 
of days due to many reasons, including cell death, tissue 
encapsulation, or device movement; this shift necessitates the 
FMM-DL' s ability to focus on subtle but important differences 
in the data properties over days. This ability to infer subtly 
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different clusters is related to the focused topic model's ability 
| [33| to discern distinct topics that differ in subtle ways. The 
study of large quantities of data (8 days) makes the ability to 
distinguish subtle differences in clusters more challenging (the 
DP-DL-based model works well when observing data from 
one recording session, like in Figure [T] but the analysis of 
multiple days of data is challenging for HDP). 
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Fig. 4. Longitudinal data analysis of the rat motor cortex data, (a) 
Schematic of the neural recording array that was placed in the rat 
motor cortex. The red numbers identify the sensors, and a zoom-in 
of the bottom-eight sensors is shown. The sensors are ordered by the 
order of the read-out pads, at left. The presented data are for sensors 
numbered 1 to 8, corresponding to the zoomed-in region, (b) From 
the maximum-likelihood collection sample, the apportionment of data 
among mixture components (clusters). Results are shown for 45 sec 
recording periods, on each of 8 days. For example, D-4 reflects data 
on day 4. Note that while the truncation level is such that there are 
20 candidate clusters (vertical axis in (b)), only an inferred subset of 
clusters are actually used on any given day. (c) Predictive likelihood 
of held-out data. The horizontal axis represents the fraction of data 
held out during training. FMM-DL dominates NFMM-DL on these 
data. 



60% 



>-40% 



n) 



dI 20% 




4 6 8 10 12 
Number of global clusters 

(a) 



80%o 




28 30 32 34 
Number of dictionary elements 

(b) 



0.5 



CD 
T3 



-0.5 




0.5 1 

Time (msec) 



(c) 



Fig. 5. Posteriors and dictionaries from rat motor cortex data (the 
same data as in Figure Rl. (a) Approximate posterior distribution on 
the number of global clusters (mixture components), (b) Approximate 
posterior distribution of the number of dictionary elements, (c) Exam- 
ples inferred dictionary elements; amplitudes of dictionary elements 
are unit less. 



Note from Figure 4(b) that the number of detected signals 
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is different for different recording days, despite the fact that 
the recording period reflective of these data (45 sees) is the 
same for all days. This highlights the need to allow modeling 
of different firing rates, as in our model but not emphasized 
in these results. 

Among the parameters inferred by the model are approxi- 
mate posterior distributions on the number of clusters across 
all days, and on the required number of dictionary elements. 



Ch-4 Ch-5 Ch-2 Ch-7 Ch-8 Ch-1 Ch-6 Ch-3 



These approximate posteriors are shown in Figures 5(a) and 



5(b) and Figure 5(c) shows example dictionary elements. 



Although not shown for brevity, the {pi } had posterior means 
in excess of 0.9. 

To better represent insight that is garnered from the model. 
Figure l6] depicts the inferred properties of three of the clusters, 
from Day 4 (D-4 in Figure 4(b) i. Shown are the mean signal 
for the 8 channels in the respective cluster (for the 8 channels 
at the bottom of Figure |4(a)| i, and the error bars represent 
one standard deviation, as defined by the estimated posterior. 
Note that the cluster in the top row of Figure l6] corresponds 
to a localized single-unit event, presumably from a neuron 
(or a coordinated small group of neurons) near the sensors 
associated with channels 7 and 8. The cluster in the middle 
row of Figure |6] similarly corresponds to a single-unit event 
situated near the sensors associated with channels 3 and 6. 
Note the proximity of sensors 7 and 8, and sensors 3 and 6, 



from Figure 4(a) The HDP model uncovered the cluster in 
the top row of Figure l6j but not that in the middle row of 
Figure |6] (not shown). 

Note the bottom row of Figure [6] in which the mean signal 
across all 8 channels is approximately the same (HDP also 
found related clusters of this type). This cluster is deemed 
to not be associated with a single-unit event, as the sensors 
are too physically distant across the array for the signal to be 
observed simultaneously on all sensors from a single neuron. 
This class of signals is deemed associated with an artifact or 
some global phenomena, (possibly) due to movement of the 
device within the brain, and/or because of charges that build 
up in the device and manifest signals with animal motion. Note 
that in the top two rows of Figure |6]the error bars are relatively 
tight with respect to the strong signals in the set of eight, 
while the error bars in Figure l6|c) are more pronounced (the 
mean curves look smooth , but this is based upon averaging 
thousands of signals). 

In addition to recording the electrophysiological data, video 
was recorded of the rat throughout the experiment. Robust 
PCA p4) was used to quantify the change in the video from 
frame-to-frame, with high change associated with large motion 
by the animal (this automation is useful because one hour of 
data are collected on each day; direct human viewing is tedious 
and unnecessary). On Day 4, the model infers that in periods 
of high animal activity, 20% to 40% of the detected signals are 
due to single-unit events (depending on which portion of data 
are considered); during periods of relative rest 40% to 70% of 
detected signals are due to single-unit events. This suggests 
that animal motion causes signal artifacts, as discussed in 
Section |I] 

In these studies the total fraction of single-unit events, even 
when at rest, diminishes with increasing number of days from 
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Fig. 6. Example clusters inferred for data on the bottom 8 channels 
of Fig. \4(a)\ (a)-(b) Example of single-unit events, (c) Example of a 
cluster not attributed to a single-unit-event. The 8 signals are ordered 
from left to right consistent with the numbering of the 8 channels at 
the bottom of Figure \4(a)\ The black cur\>es represent the mean, and 
the error bars are one standard deviation. 



sensor implant; this may be reflective of changes in the system 
due to the glial immune response of the brain f5l, f25|. The 
discerning ability of the proposed FMM-DL to distinguish 
subtly different signals, and analysis of data over multiple 
days, has played an important role in this analysis. Further, 
longitudinal analyses like that in Figure [6] were the principal 
reason for modeling the data on all A^ = 8 channels jointly 
(the ability to distinguish single-unit events from anomalies is 
predicated on this multi-channel analysis). 

D. Model tuning 

As constituted in Section lllj the model is essentially pa- 
rameter free. All of the hyperparameters are set in a relatively 
diffuse manner (see the discussion at the beginning of Section 
[nU l, and the model infers the number of clusters and their 
composition with no parameter tuning required. While this 
may generally be viewed as a strength, there are situations 
for which a neuroscientist may wish to favor particular kinds 
of clusterings, and to have an adjustable parameter with 
which different solutions may be considered. All of the results 
presented above were manifested without any model tuning. 
We now discuss how one may constitute a single "knob" (pa- 
rameter) that a neuroscientist may "turn" to examine different 
kinds of results. 



In Section II-B the variance of additive noise (ei, • • • , e„) 
are controlled by the covariance diag(?7]^^, • • • ,rij^^). If we 
set dmg{ri^^ , ■ ■ ■ ,77^^) = cjq^^It, then parameter ujq may be 
tuned to control the variability (diversity) of spikes. The cluster 
diversity encouraged by setting different values of cuo in turn 
manifests different numbers of clusters, which a neuroscientist 
may adjust as desired. As an example, we consider the 



publicly available data from Section III-A and clusterings 
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(color coded) are shown for two settings of cjq in Figure |7] In 
this figure, each spike is depicted in two-dimensional learned 
feature space, taking two arbitrary features (because features 
are not inherently ordered); this is simply for display purposes, 
as here feature learning is done via dictionary learning, and 
in general more than two dictionary components are utilized 
to represent a given waveform. 

The value of ujq defines how much of a given signal is 
associated with noise E^j, and how much is attributed to 
the term DAS^j characterized by a summation of dictionary 
elements (see (1)). If wq is large, then the noise contribution 
to the signal is small (because the noise variance is imposed 
to be small), and therefore the variability in the observed data 
is associated with variability in the underlying signal (and that 
variability is captured via the dictionary elements). Since the 
clustering is performed on the dictionary usage, if ojq is large 
we expect an increasing number of clusters, with these clusters 
capturing the greater diversity/variability in the underlying 
signal. By contrast, if ujq is relatively small, more of the signal 
is attributed to noise Ejj, and the signal components modeled 
via the dictionary are less variable (variability is attributed to 
noise, not signal). Hence, as luq diminishes in size we would 
expect fewer clusters. This phenomenon is observed in the 
example in Figure [7] with this representative of behavior we 
have observed in a large set of experiments on the rat motor 
cortex data. 



E. Sparsely Firing Neurons 

Recently, several manuscripts have directly addressed spike 
sorting in the present of sparsely firing neurons Q, p2^ . Based 
on reviewer recommendations, we assessed the performance 
of FMM-DL in such regimes utilizing the following synthetic 
data. First, we extracted spike waveforms from four clusters 
from the new dataset discussed in Section HI-FI We excluded 
all waveforms that did not clearly separate (Figure |8jal)) 
to obtain clear clustering criteria (Figure I8ta2)). Then, we 
added independent and identically distributed Gaussian noise 
to each waveform at two different levels to obtain increasingly 
noisy and less- well separated clusters (Figure ISjbl), (b2), 
(cl), and (c2)). We applied FMM-DL, Wave-clus |22| and 
Wave-clus "forced" (in which we hand tune the parameters to 
obtain optimal results) and ISOMAP dominant sets |2| to all 
three signal-to-noise ratio (SNR) regimes to assess our relative 
performance with the following results. 

The third column of Figurel8]shows the posterior estimate of 
the number of clusters for each of the three scenarios. As long 
as SNR is relatively good, for example, higher than 2 in this 
simulation, the posterior number of clusters inferred by FMM- 
DL correctly has its maximum at four clusters. Similarly, for 
the good and moderate SNR regimes, the confusion matrix is 
essentially a diagonal matrix, indicating that FMM-DL assigns 
spikes to the correct cluster. Only in the poor SNR regime 
(SNR=1.5), does the posterior move away from the truth. This 
occurs because Unit 1 becomes over segmented, as depicted in 
(c2). (c4) shows that only this unit struggles with assignment 
issues, suggestive of the possibility of a post-hoc correction if 
desired. 
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Fig. 7. Effect of manually tuning ujq to obtain a different number of 
features for the rat motor cortex data, (a) Waveforms projected down 
onto two learned features based on cluster result with cjq = 10 , the 
number of inferred clusters is two. (b) Same as (a) with uo = 10 ; 
the number of inferred clusters is seven. 



Figure l9|a) compares the performance of FMM-DL to 
previously proposed methods. Even after fine-tuning the Wave- 
clus method to obtain its optimal performance on these data, 
FMM-DL yields a better accuracy. In addition to obtaining 
better point-estimates of spiking, via our Bayesian generative 
model, we also obtain posteriors over all random variables 
of our model, including number of spikes per unit. Figure 
|9|b) and (c) show such posteriors, which may be used by the 
experimentalist to assess data quality. 

F. Computational requirements 

The software used for the tests in this paper were writ- 
ten in (non-optimized) Matlab, and therefore computational 
efficiency has not been a focus. The principal motivating 
focus of this study concerned interpretation of longitudinal 



spike waveforms, as discussed in Section III-C for which 



computation speed is desirable, but there is not a need for real- 
time processing (for example, for a prosthetic). Nevertheless, 
to give a sense of the computational load for the model, 
it takes about 20 seconds for each Gibbs sample, when 
considering analysis of 170800 spikes across iV = 8 channels; 
computations were performed on a PC, specifically a Lenevo 
T420 (CPU is Intel(R) Core (TM) i? M620 with 4 GB RAM). 
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Fig. 8. Sparse firing results on synthetic data based on the Pittsburgh dataset. The three rows correspond to three different signal-to noise 
ratio (SNR) levels: (a) 1, (b) 1.5, and (c) 2.5. The four columns correspond to: (1) cluster results of spike waveforms with colors representing 
different clusters, (2) plots of learned features based on cluster result, (3) approximate posterior distribution of cluster numbers, and (4) 
confusion matrix heatmap, Split 1 and Split 2 in (c4) are the clusters split from unit 1. Note that we accurately recover all the sparsely 
spiking neurons except the sparsest one in the noisiest regime. 
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Significant computational acceleration may be manifested by 
coding in C, and via development of online methods for 
Bayesian inference (for example, see |30|). In the context of 
such online Bayesian learning one typically employs approxi- 
mate variational Bayes inference rather than Gibbs sampling, 
which typically manifests significant acceleration |r30J. 

IV. Discussion 

A. Summary 

A new focused mixture model (FMM) has been developed, 
motivated by real-world studies with longitudinal electrophysi- 
ological data, for which traditional methods like the hierarchi- 
cal Dirichlet process have proven inadequate. In addition to 
performing "focused" clustering, the model jointly performs 
feature learning, via dictionary learning, which significantly 
improves performance over principal components. We explic- 
itly model the count of signals within a recording period by 
Pi. The rate of neuron firing constitutes a primary information 
source 1|9|, and therefore it is desirable that it be modeled. 
This rate is controlled here by a parameter (f>m , and this was 
allowed to be unique for each recording period i. 

B. Future Directions 

In future research one may constitute a mixture model 
on 4>m, with each mixture component reflective of a latent 
neural (firing) state; one may also explicitly model the time 
dependence of (pm , as in the Mixture of Kalmans work |J7J. 
Inference of this state could be important for decoding neural 
signals and controlling external devices or muscles. In future 
work one may also wish to expUcitly account for covariates 
associated with animal activity f29\, which may be linked to 
the firing rate we model here (we may regress pi to observed 
covariates). 

In the context of modeling and analyzing electrophysiolog- 
ical data, recent work on clustering models has accounted for 
refractory-time violations |J7), p], 1 13 1, which occur when two 
or more spikes that are sufficiently proximate are improperly 
associated with the same cluster/neuron (which is impossible 
physiologically due to the refractory time delay required for 
the same neuron to re-emit a spike). The methods developed 
in jSl, p3) may be extended to the class of mixture models 
developed above. We have not done so for two reasons: 
{i) in the context of everything else that is modeled here 
(joint feature learning, clustering, and count modeling), the 
refractory-time-delay issue is a relatively minor issue in prac- 
tice; and iii) perhaps more importantly, an important issue is 
that not all components of electrophysiological data are spike 
related (which are associated with refractory-time issues). As 



demonstrated in Section III a key component of the proposed 
method is that it allows us to distinguish single-unit (spike) 
events from other phenomena. 

Perhaps the most important feature of spike sorting meth- 
ods that we have not explicitly included in this model is 
"overlapping spikes" Q, Q, |[12), fTT), ||28), ll3l|, 1(35). 
Preliminary analysis of our model in this regime (not shown), 
inspired by reviewer comments, demonstrated to us that while 
the FMM-DL as written is insufficient to address this issue. 



a minor modification to FMM-DL will enable "demixing" 
overlapping spikes. We are currently pursuing this avenue. 
Neuronal bursting — which can change the waveform shape of 
a neuron — is yet another possible avenue for future work. 
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Appendix 

A. Connection to Bayesian Nonparametric Models 

The use of nonparametric Bayesian methods like the Dirich- 
let process (DP) (S), 1 13 1 removes some of the ad hoc character 



of classical clustering methods, but there are other limitations 
within the context of electrophysiological data analysis. The 
DP and related models are characterized by a scale parameter 
a > 0, and the number of clusters grows as 0{a\ogS) 126^, 
with S the number of data samples. This growth without limit 
in the number of clusters with increasing data is undesirable in 
the context of electrophysiological data, for which there are 
a finite set of processes responsible for the observed data. 
Further, when jointly performing mixture modeling across 
multiple tasks, the hierarchical Dirichlet process (HDP) LZTl 
shares all mixture components, which may undermine infer- 
ence of subtly different clusters. 

In this paper we integrate dictionary learning and clustering 
for analysis of electrophysiological data, as in IS], p4). 
However, as an alternative to utilizing a method like DP or 
HDP fSl, fT3l for clustering, we develop a new hierarchical 
clustering model in which the number of clusters is modeled 
explicitly; this implies that we model the number of underlying 
neurons — or clusters — separately from the firing rate, with 
the latter controlling the total number of observations. This 



is done by integrating the Indian buffet process (IBP) |15| 



with the Dirichlet distribution, similar to [33l, but with unique 
characteristics. The IBP is a model that may be used to learn 
features representative of data, and each potential feature is 
a "dish" at a "buffet"; each data sample (here a neuronal 
spike) selects which features from the "buffet" are most 
appropriate for its representation. The Dirichlet distribution is 
used for clustering data, and therefore here we jointly perform 
feature learning and clustering, by integrating the IBP with 
the Dirichlet distribution. The proposed framework explicitly 
models the quantity of data (for example, spikes) measured 
within a given recording interval. To our knowledge, this is the 
first time the firing rate of electrophysiological data is modeled 
jointly with clustering and jointly with feature/dictionary 
learning. The model demonstrates state-of-the-art clustering 
performance on publicly available data. Further, concerning 
distinguishing single-unit-events, we demonstrate how this 
may be achieved using the FMM-DL method, considering new 
measured (experimental) electrophysiological data. 
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B. Relationship to Dirichlet priors 

A typical prior for tt''^ is a symmetric Dirichlet distribution 

7r(') - Dir(5o/M, . . . , ao/M). (14) 

In the limit, M — ^ oo, this reduces to a draw from a Dirichlet 
process (sj, |[l3|, represented tt^*) -- DP(aoG'o), with Gq 
the "base" distribution defined in (HI). Rather than drawing 
each TT^'^ independently from DP(aoGo)j we may consider 
the hierarchical Dirichlet process (HDP) pT) as 

tt'*) - DP(5iG) , G-DP(<5oGo) (15) 

The HDP construction imposes that the {tt^'^} share the 
same set of "atoms" {/imn,S^mn}, implying a sharing of 
the different types of clusters across the time intervals i at 
which data are collected. A detailed discussion of the HDP 
formulation is provided in ['SI. 

These models have limitations in that the inferred number 
of clusters grows with observed data (here the clusters are 
ideally connected to neurons, the number of which will not 
necessarily grow with longer samples). Further, the above 
clustering model assumes the number of samples is given, 
and hence is not modeled (the information-rich firing rate 
is not modeled). Below we develop a framework that yields 
hierarchical clustering like HDP, but the number of clusters 
and the data count (for example, spike rate) are modeled 
explicitly. 



with one of the clusters (to know not just how many members 
of X>i are apportioned to a given cluster, but also which data 
are associated with a given cluster). Toward this end, consider 
the alternative equivalent generative process for {n*,-^}m=i,M 
(see Lemma 4.1 in pTJ for a proof of equivalence): first draw 



M 



A/,- Poisson(X;,„=i fr 



^(') 



("Ii 



iM 



, and then 

Mult(M,;^(*\ 






.(*) 



WIW 



a(») 






WiW 



(16) 
(17) 

{4>m\, {bm}, and {pi\ constituted as in (roli- 



/E™'=i^! 



with 

(|7|. Note that we have M, 

"{i) 



^M 



NegBin(X;™=iCVm,K) by 



W, 



marginalizing out 
Rather than 

(i) 



'A// 



drawing 



\n 



for each 



of the M, 



data we 
,„ 8ra, where 



Muh(Af,;7ri' 

may draw indicator variables Zij ^ Tlim=i ^ 

5m is a unit measure concentrated at the point m. Variable 

Zij assigns data sample j e {!,..., Mi] to one of the M 

possible clusters, and n*^ = X]i=i ^{^tj = '^)' with l(-) 

equal to one if the argument is true, and zero otherwise. The 

probability vector tt'*) defined in \\l\ is now used within the 

mixture model in Q. 

As a consequence of the manner in which (jrrn is drawn in 
(|6|l, and the definition of tt*^'^ in ( 17 1, for any pi e (0, 1), the 
proposed model imposes 



rW 



Dir(&i 



(^), 



.^m^'m) 



(18) 



C. Other Formulations of the FMM 

Let the total set of data measured during interval i be 
represented X>,j ~ {Xy} -^i^ where Mi is the total number of 
events during interval i. In the experiments below, a "recording 
interval" corresponds to a day on which data were recorded 
for an hour (data are collected separately on a sequence of 
days), and the set {Xi^} Jj^ defines all signals that exceeded a 
threshold during that recording period. In addition to modeling 
Mi, we wish to infer the number of distinct clusters Ci 
characteristic of X>i, and the relative fraction (probability) with 
which the Mi observations are apportioned to the Ci clusters. 

Let n*„j represent the number of data samples in X>i that 
are apportioned to cluster m £ {!,..., M} — S, with Mi— 
J2m=i''^im- The set Si C S, with d — \Si\, defines the 
active set of clusters for representation of T>i, and therefore 
M serves as an upper bound (n* = for m £ S \ Si). 

We impose n*„j ^ Poisson(6m 0m ) with the priors for 
bm and i^m given in Eqs. (|6]l and Q. Note that n*^ = 
when bm — 0, and therefore 6'^*-' = {b]^ , . . . ,bll)'^ defines 
indicator variables identifying the active subset of clusters 
Si for representation of X>i. Marginalizing out 0™ , n*„j ^ 
NegBm{bm 4'm,Pi)- This emphasize another motivation for the 
form of the prior: the negative binomial modeling of the counts 
(firing rate) is more flexible than a Poisson model, as it allows 
the mean and variance on the number of counts to be different 
(they are the same for a Poisson model). 

While the above construction yields a generative process 
for the number, n*„j, of elements of X>i apportioned to cluster 
TO, it is desirable to explicitly associate each member of X>i 



D. Additional Connections to Other Bayesian Models 

Eq. ( fTSJ l demonstrates that the proposed model is a gen- 
eralization of (14 1. Considering the limit M — > oo, and 
upon marginalizing out the {vm}, the binary vectors {b^^^} 
are drawn from the Indian buffet process (IBP), denoted 
5(0 r^ IBP(a). The number of non-zero components in each 
b*^') is drawn from Poisson(a), and therefore for finite a 
the number of non-zero components in 5^*^ is finite, even 
when M -> 00. Consequently Dir{b]^ (pi, . . . ,bll(pM) is 
well defined even when Af — > 00 since, with probability 
one, there are only a finite number of non-zero parameters 
in {b]^ (pi, . . . ,b]^l(j)]\{). This model is closely related to the 



compound IBP Dirichlet (CID) process developed in p3) , with 
the following differences. 

Above we have explicitly derived the relationship between 
the negative binomial distribution and the CID, and with this 
understanding we recognize the importance of p^; the CID 
assumes pi — 1/2, but there is no theoretical justification for 
this. Note that Mi^ NegBin(^^^j^ 6™ 0„V,Pi). The mean 
of M, is (I]m=i ^™Vm)K/(l ~ Pt), and the variance is 
(Z]m=i^™Vm)pj/(l - Pi)"^- If Pi is fixed to be 1/2 as in 
|33l, this implies that we believe that the variance is two 
times the mean, and the mean and variance of Mi are the 
same for all intervals i and i' for which b^^'> — b^'' \ However, 
in the context of electrophysiological data, the rate at which 
neurons fire plays an important role in information content 
1)9 J. Therefore, there are many cases for which intervals i and 
i' may be characterized by firing of the same neurons (i.e., 
b(') = 6(^')) but with very different rates (Mi ^ Mi,). The 
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modeling flexibility imposed by inferring pi therefore plays 
an important practical role for modeling electrophysiological 
data, and likely for other clustering problems of this type. 

To make a connection between the proposed model 
and the HDP, motivated by (p]l-(P]l, consider 4> = 
(01,- •• ,^j\/) ^ Dir(7o,--- ,7o), which corresponds to 
(01, . . . ,0M)/X]m'=i '/'m'- From we yield a normalized 
form of the vector = (0i,...,0m)- The normalization 
constant X]m=i 4'm is lost after drawing cj); however, be- 
cause 0„j ~ Ga(7o, 1), we may consider drawing ai ^ 
Ga(M7o,l), and approximating 4> ~ cti4>- With this ap- 
proximation for 4>, TT*^*^ may be drawn approximately as 
7r(*) ~ 'Dvi:[aiol' (j>i, . . . , cii&)y^0A/)- This yields a simplified 



and approximate hierarchy 



ttW -Dir((5i(bW0(?!))) (19) 

4> = {(f>i, ■ ■ ■ Am) ^ Dir(7o, • • ' ,7o), "i ^ Ga(Af7o, 1) 

with b*^*^ ^ IBP(a) and representing a pointwise/Hadamard 
product. If we consider 70 — ao/M, and the limit AI ^■ 
cx), with b'*' all ones, this corresponds to the HDP, with 
di ~ Ga(do, 1). We call such a model the non-focused 
mixture model (NFMM). Therefore, the proposed model is 
intimately related to the HDP, with three differences: (i) pi is 
not restricted to be 1/2, which adds flexibility when modeling 
counts; (ii) rather than drawing cf) and the normalization 
constant Ofi separately, as in the HDP, in the proposed model 
cf) is drawn directly via 0^ ~ Ga(7o, 1), with an explicit link 
to the count of observations Mi ~ NegBin(^^jj^j^ bm4'm,Pi)\ 
and {Hi) the binary vectors 6^') "focus" the model on a sparse 
subset of the mixture components, while in general, within 
the HDP, all mixture components have non-zero probability 
of occurrence for all tasks i. As demonstrated in Section [III] 
this focusing nature of the proposed model is important in the 
context of electrophysiological data. 



E. Proof of Lemma 3.1 

Proof Denote Wj — X]!?=i ^/^ J = I7 ' ■ ' > "^- Since Wj is 
the summation of j iid Log(p) distributed random variables, 
the probability generating function of Wj can be expressed as 

Gw {z) = [ln(l — pz)/\n{l — p)]'' , \z\ < p^^, thus we have 

PrK- =m) = G|;r/(0)/m! = £L [\n{l - pz) /\n{l - p)y 

- (-1)"V j!s(m, j)/[ln(l -p)]J" (20) 

where we use the property that [ln(l+a;)]^ = j! X^i^? '^ n] 
[ fTSJ . Therefore, we have 

Pt{£ — j|— ) ex Pi{wj ~ n)Pois(j; — r ln(l — p)) 

ex (-l)"+^'s(n,j)/"!^^ = F{n,j)r^. (21) 

■ 
The values F{n,j) can be iteratively calculated and each 
row sums to one, e.g., the 3rd to 5th rows are 

2/3! 3/3! 1/3! •• 

6/4! 11/4! 6/4! 1/4! •• 
24/5! 50/5! 35/5! 10/5! 1/5! •• 



To ensure numerical stability when (p > 1, we may also 
iteratively calculate the values of R^{n^i). 

References 

[1] Dimitrios A Adamos, Nikolaos A Laskaris, Efstratios K Kosmidis, and 
George Theophilidis. NASS: an empirical approach! to spike sorting 
with overlap resolution based on a hybrid noise-assisted methodology. 
Journal of neiiroscience methods, 190(1):129^2, June 2010. 
[2] Dimitrios A Adamos, Nikolaos A Laskaris, Efstratios K Kosmidis, and 
George Theophilidis. In quest of the missing neuron: spike sorting 
based on dominant-sets clustering. Computer methods and programs in 
biomedicine, 107(l):28-35, July 2012. 
[3] F. J. Anscombe. The statistical analysis of insect counts based on the 

negative binomial distribution. Biometrics, 1949. 
[4] Izhar Bar-Gad, Ya'acov Ritov, Eilon Vaadia, and Hagai Bergman. Failure 
in identification of overlapping spikes from multiple neuron activity 
causes artificial correlations. Journal of Neuroscience Methods, 107(1- 
2):1-13, May 2001. 
[5] R. Biran, D.C. Martin, and RA. Tresco. Neuronal cell loss accompanies 
the brain tissue response to chi'onically implanted silicon microelectrode 
arrays. Exp. Neurol, 2005. 
[6] Christopher M. Bishop. Pattern Recognition and Machine Learning. 

Springer, New York, 2006. 
[7] A. Calabrese and L. Paniski. Kalman filter mixture model for spike 

sorting of non-stationary data. J. Neuroscience Methods, 2010. 
[8] B. Chen, D.E. Carlson, and L. Carin. On the analysis of multi-channel 

neural spike data. In NIPS, 2011. 
[9] J.R Donoghue, A. Nurmikko, M. Black, and L.R. Hochberg. Assistive 
technology and robotic control using mortor cortex ensemble-based 
neural interface systems in humans with tetraplegia. / Physiol., 2007. 

[10] Gaute T EinevoU, Felix Franke, Espen Hagen, Christophe Pouzat, and 
Kenneth D Harris. Towards reliable spike-train recordings from thou- 
sands of neurons with multielectrodes. Current opinion in neurobiology, 
22(l):ll-7, February 2012. 

[II] A.A. Emondi, S.R Rebrik, A.V. Kurgansky, and K.D. Miller. Tracking 
neurons recorded from tetrodes across time. /. Neuro. Meth., 2004. 

[12] Felix Franke, Michal Natora, Clemens Boucsein, Matthias H J Munk, 
and Klaus Obermayer. An online spike detection and spike classification 
algorithm capable of instantaneous resolution of overlapping spikes. 
Journal of computational neuroscience, 29(1-2):127^8, August 2010. 

[13] J. Gasthaus, F. Wood, D. Gorur, and Y.W. Teh. Dependent Dirichlet 
process spike sorting. In Advances in Neural Information Processing 
Systems, 2009. 

[14] D. Gorur, C. Rasmussen, A. Tolias, F. Sinz, and N. Logothetis. Mod- 
elling spikes with mixtures of factor analysers. Pattern Recognition, 
2004. 

[15] T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and 
the Indian buffet process. In NIPS, 2005. 

[16] D. A. Henze, Z. Borhegyi, J. Csicsvari, A. Mamiya, K. D. Harris, and 
G. Buzsaki. Intracellular feautures predicted by extracellular recordings 
in the hippocampus in vivo. / Neurophysiology, 2010. 

[17] Joshua A Herbst, Stephan Gammeter, David Ferrero, and Richard H R 
Hahnloser. Spike sorting with hidden Markov models. Journal of 
neuroscience methods, 174(l):126-34, September 2008. 

[18] N.L. Johnson, A.W. Kemp, and S. Kotz. Univariate Discrete Distribu- 
tions. John Wiley & Sons, 2005. 

[19] J. C. Letelier and P. P. Weber. Spike sorting based on discrete wavelet 
transform coefficients. /. Neuroscience Methods, 2000. 

[20] M. S. Lewicki. A review of methods for spike sorting: the detection 
and classification of neural action potentials. Network: Computation in 
Neural Systems, 1998. 

[21] Miguel A L Nicolelis and Mikhail A Lebedev. Principles of neural en- 
semble physiology underlying the operation of brain-machine interfaces. 
Nature Reviews Neuroscience, 10(7):530-540, 2009. 

[22] Carlos Pedreira, Juan Martinez, Matias J Ison, and Rodrigo Quian 
Quiroga. How many neurons can we see with current spike sorting 
algorithms? Journal of neuroscience methods, 211(l):58-65, October 
2012. 

[23] F Rieke, D Warland, R De Ruyter Van Steveninck, and W Bialek. Spikes: 
Exploring the Neural Code, volume 20. MIT Press, 1997. 

[24] Daniel A. Spielman, Huan Wang, and John Wright. Exact Recovery of 
Sparsely-Used Dictionaries. June 2012. 

[25] D.H. Szarowski, M.D. Andersen, S. Retterer, A.J. Spence, M. Isaacson, 
H.G. Craighead, J.N. Turner, and W. Shain. Brain responses to micro- 
machined silicon devices. Brain Res., 2003. 



IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING 



[26] Y. W. Teh. Dirichlet processes. In Encyclopedia of Machine Learning. 
Springer, 2010. 

[27] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical 
dirichlet processes. Journal of the American Statistical Association, 
pages 101:1566-1581, 2006. 

[28] Carlos Vargas-Irwin and John P Donoghue. Automated spike sorting 
using density grid contour clustering and subtractive waveform decom- 
position. Journal of neuroscience methods, 164(1):1-18, August 2007. 

[29] V. Ventura. Automatic spike sorting using tuning information. Neural 
Computation, 2009. 

[30] C. Wang, J. Paisley, and D. Blei. Online variational inference for the 
hierarchical Dirichlet process. Artificial Intelligence and Statistics, 201 1. 

[31] X. Wang and A. McCallum. Topics over time: a non-markov continuous- 
time model of topical trends. The Twelfth ACM SIGKDD International 
Conference on Knowledge Discovery and Data Mining, pages 424-433, 
2006. 

[32] Bruce C Wheeler. Multi-unit neural spike sorting. Annals of Biomedical 
Engineering, 19(5), 1991. 

[33] S. Williamson, C. Wang, K.A. Heller, and D.M. Blei. The IBP compound 
Dirichlet process and its application to focused topic modeling. In ICML, 
2010. 

[34] J. Wright, Y. Peng, Y Ma, A. Ganesh, and S. Rao. Robust principal 
component analysis: Exact recovery of corrupted low-rank matrices by 
convex optimization. In NIPS, 2009. 

[35] J. Zhang, Z. Ghahramani, and Y Yang. A probabilistic model for online 
document clustering with application to novelty detection. Proceedings 
of Neural Information Processing Systems, 2004. 

[36] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson, 
G. Sapiro, and L. Carin. Nonparametric bayesian dictionary learning for 
analysis of noisy and incomplete images. IEEE Trans. Image Processing, 
2012. 

[37] M. Zhou, L.A. Hannah, D.B. Dunson, and L. Carin. Beta-negative 
binomial process and Poisson factor analysis. In AISTATS, 2012. 

[38] Mingyuan Zhou, Haojun Chen, John Paisley, Lu Ren, Lingbo Li, 
Zhengming Xing, David B. Dunson, Guillermo Sapiro, Lawrence Carin, 
and Senior Member. Nonparametric Bayesian Dictionary Learning for 
Analysis of Noisy and Incomplete Images. Image (Rochester, N.Y.), 
21(1):130-144, 2012. 



